This module builds on code contained in Coronavirus_Statistics_USAF_v006.Rmd. This file includes the latest code for analyzing data from USA Facts. USA Facts maintains data on cases and deaths by county for coronavirus in the US. Downloaded data are unique by county with date as a column and a separate file for each of cases, deaths, and population.
The intent of this code is to source updated functions that allow for readRunUSAFacts() to be run to obtain, read, process, and analyze data from USA Facts.
The tidyverse library is loaded, and the functions used for CDC daily processing are sourced. Additionally, specific functions for USA Facts are also sourced:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Functions are available in source file
source("./Generic_Added_Utility_Functions_202105_v001.R")
source("./Coronavirus_CDC_Daily_Functions_v002.R")
source("./Coronavirus_USAF_Functions_v001.R")
Further, the mapping file specific to USA Facts is sourced:
source("./Coronavirus_USAF_Default_Mappings_v002.R")
Updated functions for diagnoseClusters(), createDetailedSummaries(), createSummary(), and helperSummaryMap() are included in Coronavirus_USAF_Functions_v001.R. These functions should be checked for consistency with state-level data with just a single copy kept later.
The latest county-level burden data are downloaded:
readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20220913.csv",
"usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20220913.csv"
)
compareList <- list("usafCase"=readFromRDS("cty_newdata_20220807")$dfRaw$usafCase,
"usafDeath"=readFromRDS("cty_newdata_20220807")$dfRaw$usafDeath
)
# Use existing clusters
cty_newdata_20220913 <- readRunUSAFacts(maxDate="2022-09-11",
downloadTo=lapply(readList,
FUN=function(x) if(file.exists(x)) NA else x
),
readFrom=readList,
compareFile=compareList,
writeLog="./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log",
ovrwriteLog=TRUE,
useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
skipAssessmentPlots=FALSE,
brewPalette="Paired"
)
##
## No file has been downloaded, will use existing file: ./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20220913.csv
## Rows: 3193 Columns: 962
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (959): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 33
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 128 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
## Rows: 3193 Columns: 962
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (959): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 33
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 3 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 108 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType cases new_cases n
## <chr> <dbl> <dbl> <dbl>
## 1 before 3.45e+10 93305986 3058894
## 2 after 3.43e+10 91158859 3010036
## 3 pctchg 7.16e- 3 0.0230 0.0160
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType deaths new_deaths n
## <chr> <dbl> <dbl> <dbl>
## 1 before 5.12e+8 1034939 3058894
## 2 after 4.95e+8 964043 3010036
## 3 pctchg 3.23e-2 0.0685 0.0160
## NULL
# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20220913$useClusters,
caption="Includes only counties with 25k+ population",
brewPalette="viridis"
)
# Save the refreshed file
saveToRDS(cty_newdata_20220913, ovrWriteError=FALSE)
Vaccines data are also updated, though the process needs to integrate previous data:
cty_vaxdata_20220914 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20220914.csv",
ctyList=cty_newdata_20220913,
minDateCD=c("2022-06-09", "2022-06-09"),
maxDateCD="2022-09-01"
)
## Rows: 78961 Columns: 72
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (66): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
## state n
## <chr> <int>
## 1 AS 24
## 2 FM 25
## 3 GU 48
## 4 MH 24
## 5 MP 24
## 6 PR 1901
## 7 PW 24
## 8 VI 96
## 9 <NA> 17
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
##
## Count of NA records by column
## state FIPS popgte65_minpop popgte65_maxpop popgte65_nnA
## 0 0 0 0 0
## n
## 0
##
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## # maxpop <dbl>
## # ℹ Use `colnames()` to see all variable names
##
##
##
## Will run with parameters:
## burdenVar: cpm dpm
## vaxVar: vxcpoppct vxcpoppct
## minDateCD: 2022-06-09 2022-06-09
## maxDateCD: 2022-09-01 2022-09-01
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -289154037 -1505413 -90865 1548407 62031339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20807.226 2187.230 9.513 <2e-16 ***
## vaxMetric 8.061 33.809 0.238 0.812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7367000 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 1.82e-05, Adjusted R-squared: -0.0003019
## F-statistic: 0.05684 on 1 and 3124 DF, p-value: 0.8116
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -288389946 -1463009 -64586 1597997 65055859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 12176.47 7942.37 1.533 0.125352
## type>500k 29162.62 4682.01 6.229 5.34e-10 ***
## type100k-500k 17769.84 4666.92 3.808 0.000143 ***
## type25k-100k 17521.42 5263.39 3.329 0.000882 ***
## vaxMetric:type<25k 175.75 159.92 1.099 0.271861
## vaxMetric:type>500k -110.85 66.36 -1.670 0.094933 .
## vaxMetric:type100k-500k 60.38 74.96 0.805 0.420630
## vaxMetric:type25k-100k 66.14 99.33 0.666 0.505574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7368000 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.4677, Adjusted R-squared: 0.4663
## F-statistic: 342.4 on 8 and 3118 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3529509 -12224 -2495 14425 697276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 298.9805 29.5946 10.103 <2e-16 ***
## vaxMetric -4.0106 0.4575 -8.767 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99680 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.02401, Adjusted R-squared: 0.0237
## F-statistic: 76.86 on 1 and 3124 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3480323 -13174 -5106 10332 707402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 130.2314 107.0422 1.217 0.22384
## type>500k 383.7067 63.1011 6.081 1.34e-09 ***
## type100k-500k 99.6779 62.8978 1.585 0.11312
## type25k-100k 207.0272 70.9366 2.918 0.00354 **
## vaxMetric:type<25k -0.3702 2.1553 -0.172 0.86365
## vaxMetric:type>500k -5.4661 0.8943 -6.112 1.11e-09 ***
## vaxMetric:type100k-500k -0.5040 1.0103 -0.499 0.61791
## vaxMetric:type25k-100k -1.8241 1.3388 -1.362 0.17314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99300 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.05224, Adjusted R-squared: 0.04981
## F-statistic: 21.48 on 8 and 3118 DF, p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20220914, ovrWriteError=FALSE)
County-level data are post-processed:
cty_postdata_20220913 <- postProcessCountyData(lstCtyBurden=cty_newdata_20220913$dfPerCapita,
lstCtyVax=cty_vaxdata_20220914$vaxFix,
lstState=readFromRDS("cdc_daily_220902")$dfPerCapita
)
##
## Parameter maxDate is: 2022-09-01
# Save the refreshed file
saveToRDS(cty_postdata_20220913, ovrWriteError=FALSE)
Additional post-processing steps are run:
# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20220913, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).
# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20220913,
p1CompareStates=c("GA", "FL", "NE"),
p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"),
p3VaxTimes=sort(c("2022-01-01", "2022-08-31")),
p4DF=cty_newdata_20220913$dfPerCapita
)
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
Additional plots are also updated:
# Creating the vaccine and burden data
tmpVaxBurden <- createVaxBurdenData(lstVax=cty_vaxdata_20220914, lstBurden=cty_newdata_20220913)
# Nationwide aggregation, excluding problem states
problemStates <- c("VA", "TX", "SD", "HI", "GA", "CO", "GA", "FL", "NE", "VA")
useStates <- state.abb
tmpCountyList <- tmpVaxBurden$ctyPop %>%
filter(state %in% useStates, !(state %in% problemStates)) %>%
left_join(filter(tmpVaxBurden$dfVaxBurden, name=="vxcpoppct", date==max(date)), by="countyFIPS") %>%
mutate(vaxPct=percent_rank(value),
vaxBucket=case_when(vaxPct <= .25 ~ "3. Low", vaxPct >= .75 ~ "1. High", TRUE ~ "2. Medium")
) %>%
split(f=.$vaxBucket)
# Plot of absolute burden
plotVaxBurdenData(tmpVaxBurden,
ctyPlot=lapply(tmpCountyList, FUN=function(x) x %>% rename(bucket=vaxBucket)),
plotTitle="Counties in states with continuous county data"
)
# Plots of burden relative to May 2021
plotVaxBurdenData(tmpVaxBurden,
ctyPlot=lapply(tmpCountyList, FUN=function(x) x %>% rename(bucket=vaxBucket)),
plotTitle="Counties in states with continuous county data",
scaleToDate="2021-05-01"
)
Multiple data anomalies need to be addressed - vaccines data covering only a short time period, counties significantly revising burden downwards, etc.
Counties with significant negative burden are investigated:
keyRatio <- cty_newdata_20220913$dfPerCapita %>%
select(countyFIPS, state, date, cases, deaths) %>%
group_by(countyFIPS, state) %>%
summarize(keyRatDeath=sum(ifelse(date==max(date), deaths, 0)/max(deaths)),
keyRatCases=sum(ifelse(date==max(date), cases, 0)/max(cases)),
.groups="drop"
)
keyRatio %>%
mutate(across(where(is.numeric), .fns=function(x) round(x, 3))) %>%
count(keyRatCases, keyRatDeath) %>%
ggplot(aes(x=keyRatCases, y=keyRatDeath)) +
geom_point(aes(size=n)) +
lims(x=c(0, 1), y=c(0, 1)) +
labs(x="Latest cases vs. maximum cases",
y="Latest deaths vs. maximum deaths",
title="County-level restatement"
)
## Warning: Removed 6 rows containing missing values (`geom_point()`).
keyRatio %>%
mutate(across(where(is.numeric), .fns=function(x) round(x, 3))) %>%
count(keyRatCases, keyRatDeath) %>%
filter(keyRatCases < 1 | keyRatDeath < 1) %>%
ggplot(aes(x=keyRatCases, y=keyRatDeath)) +
geom_point(aes(size=n)) +
lims(x=c(0, 1), y=c(0, 1)) +
labs(x="Latest cases vs. maximum cases",
y="Latest deaths vs. maximum deaths",
title="County-level restatement",
subtitle="Only counties with at least one ratio .999 or lower included"
)
## Warning: Removed 4 rows containing missing values (`geom_point()`).
Restatement of deaths is much more common than restatement of cases. Counties with declines are further explored:
decStatus <- keyRatio %>%
mutate(rd=case_when(is.na(keyRatDeath) ~ -1,
keyRatDeath==1 ~ 1,
keyRatDeath>=.95 ~ .95,
keyRatDeath>=.9 ~ .9,
keyRatDeath>=.75 ~ .75,
keyRatDeath>=.5 ~ .5,
TRUE ~ 0
)
)
decStatus %>%
count(rd)
## # A tibble: 7 × 2
## rd n
## <dbl> <int>
## 1 -1 20
## 2 0 17
## 3 0.5 23
## 4 0.75 61
## 5 0.9 34
## 6 0.95 118
## 7 1 2869
decStatus %>%
count(rd, state) %>%
filter(rd<.95, rd>=0, n>1) %>%
arrange(rd, -n)
## # A tibble: 18 × 3
## rd state n
## <dbl> <chr> <int>
## 1 0 NE 4
## 2 0 AK 3
## 3 0 TX 3
## 4 0 MA 2
## 5 0.5 IL 10
## 6 0.5 NE 9
## 7 0.5 AK 2
## 8 0.75 IL 37
## 9 0.75 MA 11
## 10 0.75 NE 4
## 11 0.75 KS 3
## 12 0.75 VA 2
## 13 0.9 IL 17
## 14 0.9 NE 4
## 15 0.9 CA 2
## 16 0.9 CO 2
## 17 0.9 FL 2
## 18 0.9 VA 2
cty_newdata_20220913$dfPerCapita %>%
left_join(select(decStatus, countyFIPS, state, rd), by=c("countyFIPS", "state")) %>%
group_by(rd, date) %>%
summarize(cases=sum(cases), deaths=sum(deaths), n=n(), .groups="drop") %>%
mutate(labFacet=paste0(rd, " (n=", n, ")")) %>%
ggplot(aes(x=date, y=deaths)) +
geom_line() +
facet_wrap(~labFacet, scales="free_y") +
labs(x=NULL,
y="Reported deaths",
title="Reported deaths, facetted by change from maximum"
)
Illinois and Nebraska appear to be main drivers of reported declines (discontinuities) in deaths. Reporting is potentially lagged, as much of the issue appears to be recent.
Restatement of cases is also explored:
decStatus <- keyRatio %>%
mutate(rd=case_when(is.na(keyRatCases) ~ -1,
keyRatCases==1 ~ 1,
keyRatCases>=.99 ~ .99,
keyRatCases>=.95 ~ .95,
keyRatCases>=.5 ~ .5,
TRUE ~ 0
)
)
decStatus %>%
count(rd)
## # A tibble: 6 × 2
## rd n
## <dbl> <int>
## 1 -1 2
## 2 0 4
## 3 0.5 10
## 4 0.95 16
## 5 0.99 69
## 6 1 3041
decStatus %>%
count(rd, state) %>%
filter(rd<.99, rd>=0, n>1) %>%
arrange(rd, -n)
## # A tibble: 5 × 3
## rd state n
## <dbl> <chr> <int>
## 1 0 TX 3
## 2 0.5 NV 4
## 3 0.5 UT 2
## 4 0.95 NE 6
## 5 0.95 VA 6
cty_newdata_20220913$dfPerCapita %>%
left_join(select(decStatus, countyFIPS, state, rd), by=c("countyFIPS", "state")) %>%
group_by(rd, date) %>%
summarize(cases=sum(cases), deaths=sum(deaths), n=n(), .groups="drop") %>%
mutate(labFacet=paste0(rd, " (n=", n, ")")) %>%
ggplot(aes(x=date, y=cases)) +
geom_line() +
facet_wrap(~labFacet, scales="free_y") +
labs(x=NULL,
y="Reported cases",
title="Reported cases, facetted by change from maximum"
)
In aggregate, the cases look OK, with the exception of a few counties that may have incomplete reporting in the most recent time period. Specific county declines are further explored:
decStatus <- keyRatio %>%
mutate(rd=case_when(is.na(keyRatDeath) ~ -1,
keyRatDeath==1 ~ 1,
keyRatDeath>=.95 ~ .95,
keyRatDeath>=.9 ~ .9,
keyRatDeath>=.75 ~ .75,
keyRatDeath>=.5 ~ .5,
TRUE ~ 0
)
)
decStatus %>%
count(rd)
## # A tibble: 7 × 2
## rd n
## <dbl> <int>
## 1 -1 20
## 2 0 17
## 3 0.5 23
## 4 0.75 61
## 5 0.9 34
## 6 0.95 118
## 7 1 2869
decStatus %>%
count(rd, state) %>%
filter(rd<.95, rd>=0, n>1) %>%
arrange(rd, -n)
## # A tibble: 18 × 3
## rd state n
## <dbl> <chr> <int>
## 1 0 NE 4
## 2 0 AK 3
## 3 0 TX 3
## 4 0 MA 2
## 5 0.5 IL 10
## 6 0.5 NE 9
## 7 0.5 AK 2
## 8 0.75 IL 37
## 9 0.75 MA 11
## 10 0.75 NE 4
## 11 0.75 KS 3
## 12 0.75 VA 2
## 13 0.9 IL 17
## 14 0.9 NE 4
## 15 0.9 CA 2
## 16 0.9 CO 2
## 17 0.9 FL 2
## 18 0.9 VA 2
cty_newdata_20220913$dfPerCapita %>%
left_join(select(decStatus, countyFIPS, state, rd), by=c("countyFIPS", "state")) %>%
filter(rd > 0, rd < 0.9) %>%
ggplot(aes(x=date, y=deaths)) +
geom_line(aes(group=countyFIPS, color=state)) +
facet_wrap(~rd, scales="free_y") +
labs(x=NULL,
y="Reported deaths by county",
title="Reported deaths, facetted by change from maximum"
)
There are two separate issues - some counties appear to have incomplete data in the latest time period, while other counties appear to have significant negative restatement of data earlier in 2022
The ratio process is updated:
keyRatioDate <- cty_newdata_20220913$dfPerCapita %>%
select(countyFIPS, state, date, cases, deaths) %>%
pivot_longer(-c(countyFIPS, state, date)) %>%
arrange(countyFIPS, state, name, date) %>%
group_by(countyFIPS, state, name) %>%
mutate(ratMax=value/max(value, na.rm=TRUE), cumMax=value/cummax(value)) %>%
ungroup()
keyRatioDate
## # A tibble: 6,020,072 × 7
## countyFIPS state date name value ratMax cumMax
## <chr> <chr> <date> <chr> <dbl> <dbl> <dbl>
## 1 01001 AL 2020-01-22 cases 0 0 NaN
## 2 01001 AL 2020-01-23 cases 0 0 NaN
## 3 01001 AL 2020-01-24 cases 0 0 NaN
## 4 01001 AL 2020-01-25 cases 0 0 NaN
## 5 01001 AL 2020-01-26 cases 0 0 NaN
## 6 01001 AL 2020-01-27 cases 0 0 NaN
## 7 01001 AL 2020-01-28 cases 0 0 NaN
## 8 01001 AL 2020-01-29 cases 0 0 NaN
## 9 01001 AL 2020-01-30 cases 0 0 NaN
## 10 01001 AL 2020-01-31 cases 0 0 NaN
## # … with 6,020,062 more rows
dfMinMax <- keyRatioDate %>%
filter(!is.na(cumMax)) %>%
group_by(countyFIPS, name) %>%
summarize(minMax=min(cumMax), .groups="drop")
dfMinMax %>%
ggplot(aes(x=minMax)) +
geom_density(aes(group=name, color=name)) +
labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)",
subtitle="Lowest value per county and metric plotted",
x="Lowest value of daily burden vs. cumulative maximum of burden",
y="Density"
) +
scale_color_discrete("Metric")
dfMinMax %>%
filter(minMax == 0, name=="deaths") %>%
select(countyFIPS) %>%
left_join(cty_newdata_20220913$dfPerCapita, by="countyFIPS") %>%
ggplot(aes(x=date, y=deaths)) +
geom_line(aes(group=countyFIPS)) +
facet_wrap(~countyFIPS, scales="free_y") +
labs(title="Reported deaths by counties with zero following non-zero", x=NULL, y="Reported deaths")
While some of the declines are anomalous, others appear to be curves that are either very low volume or smooth on a rolling-7 basis
The process is repeated to examine issues by state:
keyRatioDateState <- cty_newdata_20220913$dfPerCapita %>%
group_by(state, date) %>%
summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
pivot_longer(-c(state, date)) %>%
arrange(state, name, date) %>%
group_by(state, name) %>%
mutate(ratMax=value/max(value, na.rm=TRUE), cumMax=ifelse(cummax(value)==0, 1, value/cummax(value))) %>%
ungroup()
keyRatioDateState
## # A tibble: 97,716 × 6
## state date name value ratMax cumMax
## <chr> <date> <chr> <dbl> <dbl> <dbl>
## 1 AK 2020-01-22 cases 0 0 1
## 2 AK 2020-01-23 cases 0 0 1
## 3 AK 2020-01-24 cases 0 0 1
## 4 AK 2020-01-25 cases 0 0 1
## 5 AK 2020-01-26 cases 0 0 1
## 6 AK 2020-01-27 cases 0 0 1
## 7 AK 2020-01-28 cases 0 0 1
## 8 AK 2020-01-29 cases 0 0 1
## 9 AK 2020-01-30 cases 0 0 1
## 10 AK 2020-01-31 cases 0 0 1
## # … with 97,706 more rows
dfMinMaxState <- keyRatioDateState %>%
filter(!is.na(cumMax)) %>%
group_by(state, name) %>%
summarize(minMax=min(cumMax), .groups="drop")
dfMinMaxState %>%
ggplot(aes(x=minMax)) +
geom_density(aes(group=name, color=name)) +
labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)",
subtitle="Lowest value per state and metric plotted",
x="Lowest value of daily burden vs. cumulative maximum of burden",
y="Density"
) +
scale_color_discrete("Metric")
dfMinMaxState %>%
filter(minMax < 0.95, name=="deaths") %>%
select(state) %>%
left_join(cty_newdata_20220913$dfPerCapita, by="state") %>%
group_by(state, date) %>%
summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
ggplot(aes(x=date, y=deaths)) +
geom_line(aes(group=state)) +
facet_wrap(~state, scales="free_y") +
labs(title="Reported deaths by states with meaningfully non-ascending trend", x=NULL, y="Reported deaths")
Significant restatements appear to be in MA, while missing recent data appears to be in IL. It is unclear if MO and NE are still reporting county-level deaths. The data is potentially less complete and accurate than in previous iterations
The definition of a decline is modified to be min(cur-cummax)/max, so that declines of 1 or 2 early in the data are not flagged as major percentage changes:
keyRatioDateState_v2 <- cty_newdata_20220913$dfPerCapita %>%
group_by(state, date) %>%
summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
pivot_longer(-c(state, date)) %>%
arrange(state, name, date) %>%
group_by(state, name) %>%
mutate(ratMax=value/max(value, na.rm=TRUE),
cumMax=ifelse(cummax(value)==0, 1, (value-cummax(value))/max(value, na.rm=TRUE))
) %>%
ungroup()
keyRatioDateState_v2
## # A tibble: 97,716 × 6
## state date name value ratMax cumMax
## <chr> <date> <chr> <dbl> <dbl> <dbl>
## 1 AK 2020-01-22 cases 0 0 1
## 2 AK 2020-01-23 cases 0 0 1
## 3 AK 2020-01-24 cases 0 0 1
## 4 AK 2020-01-25 cases 0 0 1
## 5 AK 2020-01-26 cases 0 0 1
## 6 AK 2020-01-27 cases 0 0 1
## 7 AK 2020-01-28 cases 0 0 1
## 8 AK 2020-01-29 cases 0 0 1
## 9 AK 2020-01-30 cases 0 0 1
## 10 AK 2020-01-31 cases 0 0 1
## # … with 97,706 more rows
dfMinMaxState_v2 <- keyRatioDateState_v2 %>%
filter(!is.na(cumMax)) %>%
group_by(state, name) %>%
summarize(minMax=min(cumMax), .groups="drop")
dfMinMaxState_v2 %>%
ggplot(aes(x=fct_reorder(state, -minMax, min), y=1+minMax)) +
geom_col(fill="lightblue") +
geom_text(aes(label=round(1+minMax, 2)), hjust=0) +
labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)",
subtitle="Lowest value of 1 + (value - cummax(value)) / max(value)",
y="Lowest value",
x=NULL
) +
coord_flip() +
facet_wrap(~name)
dfMinMaxState_v2 %>%
filter(state %in% c("IL", "TX", "MA", "NE")) %>%
select(state) %>%
left_join(cty_newdata_20220913$dfPerCapita, by="state") %>%
group_by(state, date) %>%
summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
ggplot(aes(x=date, y=deaths)) +
geom_line(aes(group=state)) +
facet_wrap(~state, scales="free_y") +
labs(title="Reported deaths by states with meaningfully non-ascending trend", x=NULL, y="Reported deaths")
dfMinMaxState_v2 %>%
filter(state %in% c("IL", "WY")) %>%
select(state) %>%
left_join(cty_newdata_20220913$dfPerCapita, by="state") %>%
group_by(state, date) %>%
summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
ggplot(aes(x=date, y=cases)) +
geom_line(aes(group=state)) +
facet_wrap(~state, scales="free_y") +
labs(title="Reported cases by states with meaningfully non-ascending trend", x=NULL, y="Reported cases")
The updated methodology better flags states with meaningful restatement problems
The process is converted to functional form:
# Function to calculate distance from global maxima and previous maxima (including self)
findDeltaFromMax <- function(df, groupBy=c(), timeVar="date", numVars=NULL) {
# FUNCTION ARGUMENTS:
# df: a data frame
# groupBy: levels to which the final data should be aggregated
# timeVar: time variable to which data should be aggregated
# numVars: numeric variables to be summarized (NULL means all numeric variables)
# Find numVars if not provided
if(is.null(numVars)) numVars <- df %>% select(where(is.numeric)) %>% names %>% setdiff(groupBy)
df %>%
group_by_at(all_of(c(groupBy, timeVar))) %>%
summarize(across(all_of(numVars), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
pivot_longer(all_of(numVars)) %>%
arrange(across(all_of(c(groupBy, "name", timeVar)))) %>%
group_by_at(all_of(c(groupBy, "name"))) %>%
mutate(ratMax=value/max(value, na.rm=TRUE),
cumMax=ifelse(cummax(value)==0, 1, value/cummax(value)),
delMax=ifelse(cummax(value)==0, 1, (value-cummax(value))/max(value, na.rm=TRUE))
) %>%
ungroup()
}
# Check for states
dfTest <- findDeltaFromMax(cty_newdata_20220913$dfPerCapita, groupBy="state", numVar=c("cases", "deaths"))
## Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
## 1.2.0.
## ℹ See details at
## <https://tidyselect.r-lib.org/reference/faq-selection-context.html>
identical(dfTest %>% select(-delMax), keyRatioDateState)
## [1] TRUE
identical(dfTest %>% select(-cumMax) %>% rename(cumMax=delMax), keyRatioDateState_v2)
## [1] TRUE
# Check for counties - old approach output NaN rather than 1 when cummax(value)=0
dfTest_v2 <- findDeltaFromMax(cty_newdata_20220913$dfPerCapita,
groupBy=c("countyFIPS", "state"),
numVar=c("cases", "deaths")
)
identical(dfTest_v2 %>% select(-delMax, -cumMax), keyRatioDate %>% select(-cumMax))
## [1] TRUE
all.equal(dfTest_v2 %>% pull(cumMax),
keyRatioDate %>% select(cumMax) %>% mutate(cumMax=ifelse(is.na(cumMax), 1, cumMax)) %>% pull(cumMax)
)
## [1] TRUE
Functions are also written for plot creation:
plotDeltaFromMax <- function(df,
dfCtyData=NULL,
plotStateRatio=FALSE,
plotDeathStates=c(),
plotCaseStates=c()
) {
# FUNCTION ARGUMENTS:
# df: data frame or tibble containing ratios by entity and date
# dfCtyData: county-level per-capita data (not needed if only plotStateRatio is run)
# plotStateRatio: should the state ratio plot be created?
# plotDeathStates: vector of states to plot on the deaths metric (c() means do not plot)
# plotCaseStates: vector of states to plot on the cases metric (c() means do not plot)
# Plot 1: state ratios
if(isTRUE(plotStateRatio)) {
p1 <- df %>%
filter(!is.na(delMax)) %>%
group_by(state, name) %>%
summarize(minMax=min(delMax), .groups="drop") %>%
ggplot(aes(x=fct_reorder(state, -minMax, min), y=1+minMax)) +
geom_col(fill="lightblue") +
geom_text(aes(label=round(1+minMax, 2)), hjust=0) +
labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)",
subtitle="Lowest value of 1 + (value - cummax(value)) / max(value)",
y="Lowest value",
x=NULL
) +
coord_flip() +
facet_wrap(~name)
print(p1)
}
# Plots 2 and 3: Create case and death data
if((length(plotCaseStates) > 0) | (length(plotDeathStates) > 0)) {
dfPlot <- dfCtyData %>%
select(state, date, cases, deaths) %>%
filter(state %in% all_of(union(plotCaseStates, plotDeathStates))) %>%
group_by(state, date) %>%
summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop")
}
# Plot 2: Death states
if(length(plotDeathStates) > 0) {
p2 <- dfPlot %>%
filter(state %in% all_of(plotDeathStates)) %>%
ggplot(aes(x=date, y=deaths)) +
geom_line(aes(group=state)) +
facet_wrap(~state, scales="free_y") +
labs(title="Reported deaths by states with meaningfully non-ascending trend",
x=NULL,
y="Reported deaths"
)
print(p2)
}
# Plot 3: Case states
if(length(plotCaseStates) > 0) {
p3 <- dfPlot %>%
filter(state %in% all_of(plotCaseStates)) %>%
ggplot(aes(x=date, y=cases)) +
geom_line(aes(group=state)) +
facet_wrap(~state, scales="free_y") +
labs(title="Reported cases by states with meaningfully non-ascending trend",
x=NULL,
y="Reported cases"
)
print(p3)
}
}
plotDeltaFromMax(dfTest, plotStateRatio=TRUE)
plotDeltaFromMax(dfTest,
dfCtyData=cty_newdata_20220913$dfPerCapita,
plotDeathStates=c("IL", "TX", "MA", "NE"),
plotCaseStates=c("IL", "WY")
)
Missing vaccines data are also explored:
tmpVaxCounts <- readFromRDS("cty_vaxdata_20220709")$vaxFix %>%
count(date, name="vax0709") %>%
full_join(cty_vaxdata_20220914$vaxFix %>%
count(date, name="vax0914"),
by="date"
)
tmpVaxCounts %>%
pivot_longer(-c(date)) %>%
mutate(value=ifelse(is.na(value), 0, value)) %>%
ggplot(aes(x=date, y=value)) +
geom_line(aes(group=name, color=name)) +
labs(title="# Counties reporting vaccines by date", x=NULL, y="# Counties reporting") +
scale_color_discrete("Source Date")
There is only a small overlapping section of data. Older data will need to be merged for a complete dataset. Consistency of reported vaccines is also checked:
tempGetVax <- function(df, groupBy=c("date")) {
df %>%
group_by_at(all_of(groupBy)) %>%
summarize(vxc=sum(vxcpoppct*pop/100, na.rm=TRUE),
vxcgte18=sum(vxcgte18pct*popgte18/100, na.rm=TRUE),
vxcgte65=sum(vxcgte65pct*popgte65/100, na.rm=TRUE),
.groups="drop"
)
}
tmpVaxSum <- tempGetVax(cty_vaxdata_20220914$vaxFix) %>%
bind_rows(tempGetVax(readFromRDS("cty_vaxdata_20220709")$vaxFix), .id="src")
tmpVaxSum %>%
mutate(src=c("1"="14-SEP-22", "2"="09-JUL-22")[src]) %>%
pivot_longer(-c(src, date)) %>%
ggplot(aes(x=date, y=value)) +
geom_line(aes(group=src, color=src)) +
facet_wrap(~name) +
labs(title="Vaccinated by source, date, and age bucket", x=NULL, y="# Fully vaccinated") +
scale_color_discrete("Data From:")
Data volumes appear to be broadly consistent, further suggestive that pasting back missing historical data is reasonable.
The latest county-level burden data are downloaded:
readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221010.csv",
"usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20221010.csv"
)
compareList <- list("usafCase"=readFromRDS("cty_newdata_20220913")$dfRaw$usafCase,
"usafDeath"=readFromRDS("cty_newdata_20220913")$dfRaw$usafDeath
)
# Use existing clusters
cty_newdata_20221010 <- readRunUSAFacts(maxDate="2022-10-08",
downloadTo=lapply(readList,
FUN=function(x) if(file.exists(x)) NA else x
),
readFrom=readList,
compareFile=compareList,
writeLog="./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log",
ovrwriteLog=TRUE,
useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
skipAssessmentPlots=FALSE,
brewPalette="Paired"
)
## Rows: 3193 Columns: 992
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (989): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 30
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 184 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
## Rows: 3193 Columns: 992
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (989): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 30
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 66 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType cases new_cases n
## <chr> <dbl> <dbl> <dbl>
## 1 before 3.73e+10 92832282 3154684
## 2 after 3.70e+10 90648951 3104296
## 3 pctchg 8.36e- 3 0.0235 0.0160
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType deaths new_deaths n
## <chr> <dbl> <dbl> <dbl>
## 1 before 5.43e+8 1046659 3154684
## 2 after 5.24e+8 972745 3104296
## 3 pctchg 3.44e-2 0.0706 0.0160
## NULL
# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20221010$useClusters,
caption="Includes only counties with 25k+ population",
brewPalette="viridis"
)
# Save the refreshed file
saveToRDS(cty_newdata_20221010, ovrWriteError=FALSE)
Vaccines data are also updated, though the process will need to integrate previous data:
cty_vaxdata_20221011 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20221011.csv",
ctyList=cty_newdata_20221010,
minDateCD=c("2022-06-09", "2022-06-09"),
maxDateCD="2022-09-29"
)
## Rows: 446971 Columns: 72
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (66): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
## state n
## <chr> <int>
## 1 AS 136
## 2 FM 136
## 3 GU 272
## 4 MH 136
## 5 MP 136
## 6 PR 10756
## 7 PW 136
## 8 VI 545
## 9 <NA> 79
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
##
## Count of NA records by column
## state FIPS popgte65_minpop popgte65_maxpop popgte65_nnA
## 0 0 0 0 0
## n
## 0
##
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## # maxpop <dbl>
## # ℹ Use `colnames()` to see all variable names
##
##
##
## Will run with parameters:
## burdenVar: cpm dpm
## vaxVar: vxcpoppct vxcpoppct
## minDateCD: 2022-06-09 2022-06-09
## maxDateCD: 2022-09-29 2022-09-29
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -292485037 -1737398 -20921 1895138 67166643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24796.753 2297.542 10.793 <2e-16 ***
## vaxMetric 9.448 35.514 0.266 0.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7738000 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 2.266e-05, Adjusted R-squared: -0.0002974
## F-statistic: 0.07078 on 1 and 3124 DF, p-value: 0.7902
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -291660235 -1751594 -20106 1897925 69792822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 15606.10 8344.59 1.870 0.0615 .
## type>500k 31568.15 4919.11 6.417 1.60e-10 ***
## type100k-500k 21288.00 4903.26 4.342 1.46e-05 ***
## type25k-100k 21739.66 5529.94 3.931 8.63e-05 ***
## vaxMetric:type<25k 194.27 168.02 1.156 0.2477
## vaxMetric:type>500k -88.16 69.72 -1.264 0.2062
## vaxMetric:type100k-500k 67.89 78.76 0.862 0.3888
## vaxMetric:type25k-100k 70.74 104.37 0.678 0.4979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7741000 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.5302, Adjusted R-squared: 0.529
## F-statistic: 439.8 on 8 and 3118 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3549904 -14579 -2095 18933 693414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 352.2786 30.1110 11.70 <2e-16 ***
## vaxMetric -4.4866 0.4654 -9.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101400 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.02888, Adjusted R-squared: 0.02857
## F-statistic: 92.92 on 1 and 3124 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3500417 -16774 -5431 14429 705700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 178.8397 108.8961 1.642 0.100629
## type>500k 413.3400 64.1940 6.439 1.39e-10 ***
## type100k-500k 147.3023 63.9871 2.302 0.021397 *
## type25k-100k 254.8892 72.1652 3.532 0.000418 ***
## vaxMetric:type<25k -0.4989 2.1926 -0.228 0.820011
## vaxMetric:type>500k -5.6202 0.9098 -6.177 7.36e-10 ***
## vaxMetric:type100k-500k -0.9261 1.0278 -0.901 0.367661
## vaxMetric:type25k-100k -2.1172 1.3620 -1.555 0.120155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101000 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.07933, Adjusted R-squared: 0.07696
## F-statistic: 33.58 on 8 and 3118 DF, p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20221011, ovrWriteError=FALSE)
County-level data are post-processed:
cty_postdata_20221010 <- postProcessCountyData(lstCtyBurden=cty_newdata_20221010$dfPerCapita,
lstCtyVax=cty_vaxdata_20221011$vaxFix,
lstState=readFromRDS("cdc_daily_221002")$dfPerCapita
)
##
## Parameter maxDate is: 2022-10-01
# Save the refreshed file
saveToRDS(cty_postdata_20221010, ovrWriteError=FALSE)
The Alaska portion of usmap::plot_usmap() has an issue with Valdex county. The function is updated to allow for excluding problematic states:
# Updated to allow for include and exclude
makeBurdenDatePlot <- function(df,
keyVar,
timeLabel,
plotTitle=NULL,
varLabel=NULL,
varFloor=0,
varCeiling=+Inf,
varDivBy=1,
vecRename=c("countyFIPS"="fips"),
includeStates=c(),
excludeStates=c()
) {
# FUNCTION ARGUMENTS:
# df: a processed data frame with fips, asofDate, burden
# keyVar: character string for variable to be plotted
# timeLabel: character string for amount of time (e.g., "1-month" or "5-week")
# plotTitle: title for the plot (NULL means infer from other arguments)
# varLabel: label for the variable in plot scale (NULL means infer from other arguments)
# varFloor: minimum value to be allowed for variable (-Inf means no floor applied)
# varCeiling: maximum value to be allowed for variable (Inf means no ceiling applied)
# varDivBy: variable should be divivded by this for plotting
# vecRename: renaming vector to get desired variables in frame
# includeStates: specific states to be plotted, where c() means all states
# excludeStates: specific states NOT to be plotted (ignored if includeStates has length > 0)
# Create varLabel if passed as NULL
if(is.null(varLabel)) {
varLabel <- stringr::str_to_upper(stringr::str_extract(keyVar, "^[A-Za-z]*"))
if((varDivBy > 1) & isTRUE(all.equal(log10(varDivBy) %% 1, 0)))
varLabel <- paste0(varLabel, "(", stringr::str_replace(varDivBy, pattern="1", replacement=""), "s)")
else if (varDivBy != 1) varLabel <- paste0(varLabel, "(units of ", varDivBy, ")")
}
# Create plotTitle if passed as NULL
if(is.null(plotTitle))
plotTitle <- paste0(timeLabel,
" coronavirus ",
if(str_detect(stringr::str_to_upper(keyVar), pattern="CPM")) "cases" else "deaths",
" by county"
)
# Resolve includeStates and excludeStates
if(length(includeStates) > 0) {
if(length(excludeStates) > 0) {
cat("\nParamater excludeStates will be ignored since includeStates was passed\n")
excludeStates <- c()
}
invalidInclude <- setdiff(includeStates, c(state.abb, "DC"))
if(length(invalidInclude) > 0) {
cat("\nInvalid states passed in includeStates deleted:", paste(invalidInclude, collapse=", "), "\n")
includeStates <-setdiff(includeStates, invalidInclude)
}
}
# Create and return plot
p1 <- df %>%
colRenamer(vecRename=vecRename) %>%
mutate(burden=pmax(pmin(get(all_of(keyVar)), varCeiling), varFloor)/varDivBy) %>%
select(fips, burden, asofDate) %>%
usmap::plot_usmap(regions="counties", data=., values="burden", include=includeStates, exclude=excludeStates) +
labs(title=plotTitle,
subtitle=if(varFloor > -Inf | varCeiling < +Inf) "Floors and/or ceilings applied" else NULL,
caption="Source: USA Facts"
) +
scale_fill_continuous(paste0(varLabel, "\n", timeLabel), low="grey", high="red") +
facet_wrap(~asofDate) +
theme(legend.position="bottom")
p1
}
postProcessCountyData <- function(lstCtyBurden,
lstCtyVax,
lstState,
maxDate=NULL,
minDateBurden="2020-02-15",
minDateVax="2021-04-01",
includeStates=c(),
excludeStates=c()
) {
# FUNCTION ARGUMENTS:
# lstCtyBurden: list of processed county-level burden data (or a dfPerCapita file from this list)
# lstCtyVax: list of processed county-level vaccines data (or a vaxFix file from this list)
# lstState: list of processed state-level burden data (or a dfPerCapita file from this list)
# maxDate: maximum date to use for plotting (NULL means latest date in both lstCty and lstState)
# minDateBurden: earliest date for scoring burden similarity across files
# minDateVax: earliest date for scoring vaccine similarity across files
# includeStates: specific states to be plotted, where c() means all states
# excludeStates: specific states NOT to be plotted (ignored if includeStates has length > 0)
# Extract the relevant perCapita and vaxFix data if needed
if("list" %in% class(lstCtyBurden)) lstCtyBurden <- lstCtyBurden[["dfPerCapita"]]
if("list" %in% class(lstState)) lstState <- lstState[["dfPerCapita"]]
if("list" %in% class(lstCtyVax)) lstCtyVax <- lstCtyVax[["vaxFix"]]
# Get maxDate if not provided
if(is.null(maxDate)) maxDate <- min(max(lstCtyBurden$date, na.rm=TRUE), max(lstState$date, na.rm=TRUE))
cat("\nParameter maxDate is:", as.character(maxDate), "\n\n")
# Data for score similarity process
dfCompare <- compareStateSummedCounty(dfState=lstState,
dfCounty=lstCtyBurden,
inclStates=c(state.abb, "DC"),
dateThru=maxDate,
makePlot=FALSE,
returnData=TRUE
)
scoreSimilarity(dfCompare, minDate=minDateBurden, maxDate=maxDate, makeFacet=FALSE)
# Check differences in data sources
dfAllState <- integrateStateVaccine(lstCtyVax, statePerCap=lstState, treatNAZero=TRUE)
vaxDiff <- scoreVaxSimilarity(dfAllState, minDate=minDateVax, maxDate=maxDate, returnBaseData=TRUE)
# Create county-level burden data by quarters
dfRoll91 <- createBurdenCountyDate(lstCtyBurden,
maxDate=maxDate,
rollBy=months(c(0, -3, -6, -9)),
dateSpan=91
)
makeBurdenDatePlot(dfRoll91,
keyVar="cpm91",
timeLabel="3-month",
varCeiling=100000,
varDivBy=1000,
includeStates=includeStates,
excludeStates=excludeStates
) %>%
print()
makeBurdenDatePlot(dfRoll91,
keyVar="dpm91",
timeLabel="3-month",
varCeiling=1500,
includeStates=includeStates,
excludeStates=excludeStates
) %>%
print()
# Return the key elements
list(dfCompare=dfCompare, dfAllState=dfAllState, vaxDiff=vaxDiff, dfRoll91=dfRoll91)
}
County-level data are post-processed:
cty_postdata_20221010_v2 <- postProcessCountyData(lstCtyBurden=cty_newdata_20221010$dfPerCapita,
lstCtyVax=cty_vaxdata_20221011$vaxFix,
lstState=readFromRDS("cdc_daily_221002")$dfPerCapita,
excludeStates="AK"
)
##
## Parameter maxDate is: 2022-10-01
# Check equivalence of data files
identical(cty_postdata_20221010, cty_postdata_20221010_v2)
## [1] TRUE
Additional post-processing steps are run:
# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20221010, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).
# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20221010,
p1CompareStates=c("GA", "FL", "NE"),
p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"),
p3VaxTimes=sort(c("2022-01-01", "2022-09-28")),
p4DF=cty_newdata_20221010$dfPerCapita
)
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
The function is updated to allow for excluding states (such as AK in this case where Valdez County issues in usmap cause an NA issue):
additionalCountyPostProcess <- function(lstPost,
p1CompareStates=c(),
p1AggData=FALSE,
p2VaxStates=c(),
p3VaxTimes=c(),
p4DF=NULL,
p4MaxDate=NULL,
p4RollBy=months(c(0, -1, -2, -3)),
p4DateSpan=28,
p4CPMCeiling=50000,
p4DPMCeiling=500,
includeStates=c(),
excludeStates=c()
) {
# FUNCTION ARGUMENTS:
# lstPost: list of post-processed county data
# p1CompareStates: states that should be compared vs. summed county
# p1AggData: boolean, should the comparison states all be aggregated to a single comparison?
# p2VaxStates: states that should be compared for vaccine evolution
# p3VaxTimes: character vector of form c(minDate, maxDate) for time period to score vaccine similarity
# p4DF: data frame for creating rolling data (NULL means do not run)
# p4MaxDate: maximum date for rolling analysis (NULL means use maximum date in p4DF minus 1 day)
# p4RollBy: time periods to roll back for analysis
# p4DateSpan: size of windows for rolling analysis
# p4CPMCeiling: ceiling for plots on CPM (all values at or above this will be the same color)
# p4DPMCeiling: ceiling for plots on DPM (all values at or above this will be the same color)
# includeStates: specific states to be plotted, where c() means all states
# excludeStates: specific states NOT to be plotted (ignored if includeStates has length > 0)
# 1. Plotting state vs. summed county for key states
if(length(p1CompareStates) > 0) {
compareStateSummedCounty(lstAll=lstPost[["dfCompare"]],
inclStates=p1CompareStates,
createData=FALSE,
aggData=p1AggData
)
}
# 2. Plot differences in vaccines data if needed
if(length(p2VaxStates) > 0)
stateAgeVaxEvolution(lstPost[["dfAllState"]], keyState=p2VaxStates)
# 3. Check vaccine similarity scoring on a different time period
if(length(p3VaxTimes) > 0) {
if(length(p3VaxTimes) != 2 | p3VaxTimes[2] < p3VaxTimes[1])
cat("\np3VaxTimes should be c(minDate, maxDate), skipping this step due to bad parameter\n")
else
scoreVaxSimilarity(lstPost[["dfAllState"]], minDate=p3VaxTimes[1], maxDate=p3VaxTimes[2])
}
# 4. Additional rolling data as needed
if(!is.null(p4DF)) {
if(is.null(p4MaxDate)) p4MaxDate <- max(p4DF$date) - lubridate::days(1)
dfRoll <- createBurdenCountyDate(p4DF, maxDate=p4MaxDate, rollBy=p4RollBy, dateSpan=p4DateSpan)
makeBurdenDatePlot(dfRoll,
keyVar=paste0("cpm", p4DateSpan),
timeLabel=paste0(p4DateSpan, "-day"),
varCeiling=p4CPMCeiling,
varDivBy=1000,
includeStates=includeStates,
excludeStates=excludeStates
) %>%
print()
makeBurdenDatePlot(dfRoll,
keyVar=paste0("dpm", p4DateSpan),
timeLabel=paste0(p4DateSpan, "-day"),
varCeiling=p4DPMCeiling,
includeStates=includeStates,
excludeStates=excludeStates
) %>%
print()
}
}
Additional post-processing steps are re-run:
# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20221010, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).
# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20221010,
p1CompareStates=c("GA", "FL", "NE"),
p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"),
p3VaxTimes=sort(c("2022-01-01", "2022-09-28")),
p4DF=cty_newdata_20221010$dfPerCapita,
excludeStates=c("AK")
)
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
Restatement issues are explore:
# Check for states
dfTest_20221010_state <- findDeltaFromMax(cty_newdata_20221010$dfPerCapita,
groupBy="state",
numVar=c("cases", "deaths")
)
# Check for counties
dfTest_20221010_cty <- findDeltaFromMax(cty_newdata_20221010$dfPerCapita,
groupBy=c("countyFIPS", "state"),
numVar=c("cases", "deaths")
)
# Explore state differences
plotDeltaFromMax(dfTest_20221010_state, plotStateRatio=TRUE)
plotDeltaFromMax(dfTest_20221010_state,
dfCtyData=cty_newdata_20221010$dfPerCapita,
plotDeathStates=c("IL", "MA", "NE"),
plotCaseStates=c("IL", "WY")
)
Counties are also explored for restatement, using a combination of value and ratio:
dfRatio_cty <- dfTest_20221010_cty %>%
group_by(countyFIPS, state, name) %>%
filter(delMax==min(delMax)) %>%
filter(date==max(date)) %>%
ungroup()
dfRatio_cty
## # A tibble: 6,284 × 8
## countyFIPS state date name value ratMax cumMax delMax
## <chr> <chr> <date> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01001 AL 2022-04-19 cases 15755 0.855 0.999 -0.000869
## 2 01001 AL 2022-03-27 deaths 209 0.917 0.995 -0.00439
## 3 01003 AL 2022-04-19 cases 55564 0.845 1.00 -0.000228
## 4 01003 AL 2021-04-14 deaths 300 0.420 0.997 -0.00140
## 5 01005 AL 2021-06-27 cases 2344 0.339 0.999 -0.000289
## 6 01005 AL 2021-05-19 deaths 56 0.544 0.982 -0.00971
## 7 01007 AL 2021-04-14 cases 2559 0.340 0.998 -0.000663
## 8 01007 AL 2021-04-12 deaths 58 0.537 0.967 -0.0185
## 9 01009 AL 2021-11-11 cases 10494 0.616 0.995 -0.00317
## 10 01009 AL 2021-04-22 deaths 133 0.516 0.985 -0.00775
## # … with 6,274 more rows
# Explore ratios of deaths by county - IL/MA/NE vs. others
dfRatio_cty %>%
filter(name=="deaths", value > 0) %>%
mutate(type=case_when(state=="IL" ~ "IL", state %in% c("MA", "NE") ~ "MA/NE", state=="WY" ~ "WY", TRUE ~ "Other"),
rat=1+delMax
) %>%
ggplot(aes(x=rat)) +
geom_histogram(aes(fill=type)) +
facet_wrap(~type, nrow=2, scales="free_y") +
labs(x="Lowest ratio of (deaths-cummax(deaths))/max(deaths) by county",
y="# Counties",
title="Death ratios by county (1.0 means strictly non-descending, 0.0 means reports 0 after max)"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Explore ratios of cases by county - IL/WY vs. others
dfRatio_cty %>%
filter(name=="cases", value > 0) %>%
mutate(type=case_when(state=="IL" ~ "IL", state %in% c("MA", "NE") ~ "MA/NE", state=="WY" ~ "WY", TRUE ~ "Other"),
rat=1+delMax
) %>%
ggplot(aes(x=rat)) +
geom_histogram(aes(fill=type)) +
facet_wrap(~type, nrow=2, scales="free_y") +
labs(x="Lowest ratio of (cases-cummax(cases))/max(cases) by county",
y="# Counties",
title="Case ratios by county (1.0 means strictly non-descending, 0.0 means reports 0 after max)"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Case data generally has more restatement than deaths data. IL, MA, and NE all have significant county-level changes for deaths, while WY has the most counties reporting large changes in cases
A sample of counties with high death restatements are plotted:
dfRatio_cty %>%
filter(name=="deaths", delMax < -0.01, delMax*value < -100) %>%
select(countyFIPS) %>%
left_join(cty_newdata_20221010$dfPerCapita, by="countyFIPS") %>%
ggplot(aes(x=date, y=deaths)) +
geom_line() +
facet_wrap(~paste0(countyFIPS, " (", state, ")"), scales="free_y")
The issue in IL is driven almost exclusively by 17031, which restated away around 50% of deaths. The issue in MA appears to be consistent across counties, with the restated data somewhat close to on trend prior to the spike. Two counties in TX and NJ have an erroneous spike, while one county in TX (48141) restated away almost all deaths
A sample of counties with high case restatements are plotted:
dfRatio_cty %>%
filter(name=="cases", delMax < -0.01, delMax*value < -2000) %>%
select(countyFIPS) %>%
left_join(cty_newdata_20221010$dfPerCapita, by="countyFIPS") %>%
ggplot(aes(x=date, y=cases)) +
geom_line() +
facet_wrap(~paste0(countyFIPS, " (", state, ")"), scales="free_y")
Many counties with high case restatements appear to mainly follow trend at a glance. Significant off-trend deviations occur in 17031 (IL), 25007 (MA), 48139 (TX), 48141 (TX), 48189 (TX), and 48279 (TX). The deviations in Wyoming appear to be one-time issues.
Data with and without anomalous counties can then be plotted:
anom_cty <- dfRatio_cty %>%
filter(name=="cases") %>%
mutate(bigPct=(delMax < -0.05), bigVol=(delMax*value < -2000))
anom_cty %>%
count(bigPct, bigVol)
## # A tibble: 4 × 3
## bigPct bigVol n
## <lgl> <lgl> <int>
## 1 FALSE FALSE 3011
## 2 FALSE TRUE 6
## 3 TRUE FALSE 113
## 4 TRUE TRUE 12
anom_cty %>%
select(countyFIPS, state, bigPct, bigVol) %>%
inner_join(cty_newdata_20221010$dfPerCapita, by=c("state", "countyFIPS")) %>%
mutate(type=case_when(bigVol ~ "2) Big volume change",
bigPct ~ "3) Big percent change",
TRUE ~ "1) Low/no percent change"
)
) %>%
group_by(type, date) %>%
summarize(across(c(cases, deaths), sum), .groups="drop") %>%
ggplot(aes(x=date, y=cases)) +
geom_line() +
facet_wrap(~type, scales="free_y") +
labs(title="Sum of cases by county type",
x=NULL,
y="Sum of cases",
subtitle="Big volume change is at least 2000 cases, big percentage change is at least 5% (but not 2000 cases)"
)
Case data appear reasonable after anomalies are removed. The large percentage changes make some impact but appear to generally average out and remain on trend. The large volume changes appear to be anomalies that impact trends, especially in the most recent time periods.
The process is converted to functional form:
plotByRestatement <- function(dfRatio,
dfBurden,
idVars=c("countyFIPS", "state"),
keyMetric="cases",
keyMetricBurden=keyMetric,
delMaxHurdle=0,
bigVolHurdle=0,
returnData=FALSE
) {
# FUNCTION ARGUMENTS:
# dfRatio: file containing key metrics for determining restatement status
# dfBurden: the burden data by geography
# idVars: variables that identify a geographical unit
# keyMetric: name of variable in the ratio file
# keyMetricBurden: name of variable in the burden file
# delMaxHurdle: cutoff for determining large delMax
# bigVolHurdle: cutoff for determining large delMax*value
# returnData: boolean, should dfSeg be returned?
# Create segment data
dfSeg <- dfRatio %>%
filter(name==keyMetric) %>%
mutate(bigPct=(delMax < delMaxHurdle), bigVol=(delMax*value < bigVolHurdle))
# Report on segment data
dfSeg %>%
count(bigPct, bigVol) %>%
print()
# Create and print plot
p1 <- dfSeg %>%
select(c(all_of(idVars), "bigPct", "bigVol")) %>%
inner_join(dfBurden, by=all_of(idVars)) %>%
mutate(type=case_when(bigVol ~ "2) Big volume change",
bigPct ~ "3) Big percent change",
TRUE ~ "1) Low/no percent change"
)
) %>%
group_by(type, date) %>%
summarize(across(all_of(keyMetricBurden), sum), .groups="drop") %>%
ggplot(aes_string(x="date", y=keyMetricBurden[1])) +
geom_line() +
facet_wrap(~type, scales="free_y") +
labs(title="Sum of cases by segment",
x=NULL,
y="Sum of cases",
subtitle=paste0("Big volume change is more than ",
-bigVolHurdle,
" ",
keyMetricBurden[1],
", big percentage change is at least ",
round(-100*delMaxHurdle),
"% (but not ",
-bigVolHurdle,
" ",
keyMetricBurden[1],
")"
)
)
print(p1)
# Return data if requested
if(isTRUE(returnData)) return(dfSeg)
}
dfSegTest <- plotByRestatement(dfRatio_cty,
dfBurden=cty_newdata_20221010$dfPerCapita,
keyMetric="cases",
delMaxHurdle=-0.05,
bigVolHurdle=-2000,
returnData=TRUE
)
## # A tibble: 4 × 3
## bigPct bigVol n
## <lgl> <lgl> <int>
## 1 FALSE FALSE 3011
## 2 FALSE TRUE 6
## 3 TRUE FALSE 113
## 4 TRUE TRUE 12
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation ideoms with `aes()`
identical(anom_cty, dfSegTest)
## [1] TRUE
A function is also written to create the main restatement dataset:
createRestatementData <- function(df,
geoType="county",
idVars=NULL,
numVars=c("cases", "deaths"),
varPeak="delMax",
fnPeak=min,
makeIntermediate=TRUE,
returnIntermediate=FALSE
) {
# FUNCTION ARGUMENTS:
# df: data frame containing relevant per-capita data OR previously made dfIntermediate
# geoType: the type of data (only "county" and "state" are supported)
# idVars: the ID variables for a geographical unit in df (NULL means infer from geoType)
# numVars: numeric variables for calculating restatement amounts
# varPeak: variable to be used for calculating whether restatement occurred
# fnPeak: function to be applied to determine where 'most' restatement occurred
# makeIntermediate: boolean, should dfIntermediate be created? Otherwise, df is treated as dfIntermediate
# returnIntermediate: boolean, should the process be stopped and just dfIntermediate returned?
# Infer idVars if passed as NULL
if(is.null(idVars)) {
if(geoType=="county") idVars <- c("countyFIPS", "state")
else if(geoType=="state") idVars <- c("state")
else error("\nFunction can only infer for geoType of 'county' or 'state'\n")
}
# Create intermediate data frame using findDeltaFromMax() OR as passed
if(isTRUE(makeIntermediate)) df <- findDeltaFromMax(df, groupBy=idVars, numVar=numVars)
# If returnIntermediate is TRUE, return that frame and stop the function
if(isTRUE(returnIntermediate)) return(df)
# Create and return the ratio data otherwise
df %>%
group_by_at(all_of(c(idVars, "name"))) %>%
filter(get(varPeak)==fnPeak(get(varPeak))) %>%
filter(date==max(date)) %>%
ungroup()
}
# Check data creation process for counties
dfInter_cty <- createRestatementData(cty_newdata_20221010$dfPerCapita,
geoType="county",
makeIntermediate=TRUE,
returnIntermediate=TRUE
)
identical(dfInter_cty, dfTest_20221010_cty)
## [1] TRUE
# Check data creation process for states
dfInter_state <- createRestatementData(cty_newdata_20221010$dfPerCapita,
geoType="state",
makeIntermediate=TRUE,
returnIntermediate=TRUE
)
identical(dfInter_state, dfTest_20221010_state)
## [1] TRUE
dfRatio_cty_chk <- createRestatementData(dfInter_cty,
geoType="county",
makeIntermediate=FALSE,
returnIntermediate=FALSE
)
identical(dfRatio_cty_chk, dfRatio_cty)
## [1] TRUE
dfRatio_state <- createRestatementData(dfInter_state,
geoType="state",
makeIntermediate=FALSE,
returnIntermediate=FALSE
)
dfRatio_state
## # A tibble: 102 × 7
## state date name value ratMax cumMax delMax
## <chr> <date> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AK 2021-06-14 cases 68383 0.251 0.982 -0.00448
## 2 AK 2021-08-21 deaths 382 0.292 0.972 -0.00842
## 3 AL 2022-04-19 cases 1298473 0.851 1.00 -0.000239
## 4 AL 2021-04-09 deaths 10686 0.522 1.00 -0.000244
## 5 AR 2022-10-06 cases 921985 1 1 0
## 6 AR 2021-04-01 deaths 5636 0.460 0.994 -0.00269
## 7 AZ 2022-10-06 cases 2275235 1 1 0
## 8 AZ 2022-03-08 deaths 27708 0.928 0.991 -0.00797
## 9 CA 2020-02-02 cases 27 0.00000259 0.0327 -0.0000767
## 10 CA 2021-12-27 deaths 75398 0.794 1.00 -0.000347
## # … with 92 more rows
Ratio data can then be plotted:
dfRatio_state %>%
ggplot(aes(x=fct_reorder(state, delMax), y=delMax)) +
geom_col() +
geom_text(aes(label=paste0(round(100*delMax, 0), "%")), hjust=1, size=3) +
coord_flip() +
facet_wrap(~name) +
labs(title="Maximum downward restatement from previous peak vs. maximum value for state",
x=NULL,
y=NULL,
subtitle="Data through September 2022"
)
dfRatio_cty %>%
filter(name=="deaths") %>%
top_n(25, -delMax*value) %>%
mutate(loss=-delMax*value, lab=paste0(countyFIPS, " (", state, ")")) %>%
select(lab, loss, delMax) %>%
pivot_longer(-lab) %>%
mutate(textLab=ifelse(name=="delMax", paste0(round(value*100, 0), "%"), round(value, 0))) %>%
ggplot(aes(x=fct_reorder(lab, value), y=value)) +
geom_text(aes(label=textLab, hjust=ifelse(name=="delMax", 1, 0))) +
geom_col(fill="lightblue") +
coord_flip() +
facet_wrap(~c("loss"="Magnitude of restatement", "delMax"="Percent lost")[name], scales="free_x") +
labs(x=NULL, y=NULL, title="Counties with significant death restatements", subtitle="Data as of Sep 2022")
dfRatio_cty %>%
filter(name=="cases") %>%
top_n(25, -delMax*value) %>%
mutate(loss=-delMax*value, lab=paste0(countyFIPS, " (", state, ")")) %>%
select(lab, loss, delMax) %>%
pivot_longer(-lab) %>%
mutate(textLab=ifelse(name=="delMax", paste0(round(value*100, 0), "%"), round(value, 0))) %>%
ggplot(aes(x=fct_reorder(lab, value), y=value)) +
geom_text(aes(label=textLab, hjust=ifelse(name=="delMax", 1, 0))) +
geom_col(fill="lightblue") +
coord_flip() +
facet_wrap(~c("loss"="Magnitude of restatement", "delMax"="Percent lost")[name], scales="free_x") +
labs(x=NULL, y=NULL, title="Counties with significant case restatements", subtitle="Data as of Sep 2022")
As observed previously, most states have at most modest restatement in the USAF data
The latest county-level burden data are downloaded:
readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221108.csv",
"usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20221108.csv"
)
compareList <- list("usafCase"=readFromRDS("cty_newdata_20221010")$dfRaw$usafCase,
"usafDeath"=readFromRDS("cty_newdata_20221010")$dfRaw$usafDeath
)
# Use existing clusters
cty_newdata_20221108 <- readRunUSAFacts(maxDate="2022-11-06",
downloadTo=lapply(readList,
FUN=function(x) if(file.exists(x)) NA else x
),
readFrom=readList,
compareFile=compareList,
writeLog="./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log",
ovrwriteLog=TRUE,
useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
skipAssessmentPlots=FALSE,
brewPalette="Paired"
)
## Rows: 3193 Columns: 1020
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (1017): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 28
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 89 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
## Rows: 3193 Columns: 1020
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (1017): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 28
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 38 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType cases new_cases n
## <chr> <dbl> <dbl> <dbl>
## 1 before 3.99e+10 93706741 3244088
## 2 after 3.95e+10 91519270 3192272
## 3 pctchg 9.35e- 3 0.0233 0.0160
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType deaths new_deaths n
## <chr> <dbl> <dbl> <dbl>
## 1 before 5.72e+8 1054379 3244088
## 2 after 5.52e+8 979772 3192272
## 3 pctchg 3.63e-2 0.0708 0.0160
## NULL
# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20221108$useClusters,
caption="Includes only counties with 25k+ population",
brewPalette="viridis"
)
# Save the refreshed file
saveToRDS(cty_newdata_20221108, ovrWriteError=FALSE)
Vaccines data are also updated, though the process will need to integrate previous data:
cty_vaxdata_20221109 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20221109.csv",
ctyList=cty_newdata_20221108,
minDateCD=c("2022-06-09", "2022-06-09"),
maxDateCD="2022-10-29"
)
## Rows: 421046 Columns: 72
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (66): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
## state n
## <chr> <int>
## 1 AS 129
## 2 FM 129
## 3 GU 256
## 4 MH 128
## 5 MP 128
## 6 PR 10127
## 7 PW 128
## 8 VI 512
## 9 <NA> 79
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
##
## Count of NA records by column
## state FIPS popgte65_minpop popgte65_maxpop popgte65_nnA
## 0 0 0 0 0
## n
## 0
##
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## # maxpop <dbl>
## # ℹ Use `colnames()` to see all variable names
##
##
##
## Will run with parameters:
## burdenVar: cpm dpm
## vaxVar: vxcpoppct vxcpoppct
## minDateCD: 2022-06-09 2022-06-09
## maxDateCD: 2022-10-29 2022-10-29
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -295362078 -1799774 81846 2125375 66057276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25179.46 2367.67 10.635 <2e-16 ***
## vaxMetric 46.37 36.60 1.267 0.205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7975000 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.0005136, Adjusted R-squared: 0.0001936
## F-statistic: 1.605 on 1 and 3124 DF, p-value: 0.2053
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -294332768 -1853432 41157 2093372 68302435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 16012.93 8599.68 1.862 0.062692 .
## type>500k 29996.85 5069.49 5.917 3.63e-09 ***
## type100k-500k 21017.91 5053.15 4.159 3.28e-05 ***
## type25k-100k 21759.38 5698.99 3.818 0.000137 ***
## vaxMetric:type<25k 238.15 173.16 1.375 0.169130
## vaxMetric:type>500k -25.74 71.85 -0.358 0.720217
## vaxMetric:type100k-500k 115.45 81.17 1.422 0.155015
## vaxMetric:type25k-100k 122.80 107.56 1.142 0.253656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7977000 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.5659, Adjusted R-squared: 0.5648
## F-statistic: 508 on 8 and 3118 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3581081 -16937 -1522 23241 702906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 380.7127 30.9273 12.31 <2e-16 ***
## vaxMetric -4.5654 0.4781 -9.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 104200 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.02837, Adjusted R-squared: 0.02806
## F-statistic: 91.2 on 1 and 3124 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3530993 -20262 -5986 17362 719666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 204.9955 111.8143 1.833 0.066845 .
## type>500k 392.6450 65.9142 5.957 2.86e-09 ***
## type100k-500k 178.3898 65.7018 2.715 0.006661 **
## type25k-100k 275.7474 74.0991 3.721 0.000202 ***
## vaxMetric:type<25k -0.2408 2.2514 -0.107 0.914821
## vaxMetric:type>500k -5.0306 0.9342 -5.385 7.78e-08 ***
## vaxMetric:type100k-500k -1.0842 1.0554 -1.027 0.304357
## vaxMetric:type25k-100k -1.9145 1.3985 -1.369 0.171090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103700 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.107, Adjusted R-squared: 0.1047
## F-statistic: 46.69 on 8 and 3118 DF, p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20221109, ovrWriteError=FALSE)
County-level data are post-processed:
cty_postdata_20221108 <- postProcessCountyData(lstCtyBurden=cty_newdata_20221108$dfPerCapita,
lstCtyVax=cty_vaxdata_20221109$vaxFix,
lstState=readFromRDS("cdc_daily_221102")$dfPerCapita,
excludeStates="AK"
)
##
## Parameter maxDate is: 2022-11-01
Additional post-processing steps are re-run:
# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20221108, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).
# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20221108,
p1CompareStates=c("GA", "FL", "NE"),
p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"),
p3VaxTimes=sort(c("2022-01-01", "2022-10-26")),
p4DF=cty_newdata_20221108$dfPerCapita,
excludeStates=c("AK")
)
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
Post-processing will need to be updated to account for changes in data availability in the CDC data file
Memory usage is explored:
# List of files
sapply(ls(), FUN=function(x) object.size(get(x))) %>% sort(decreasing=FALSE)
## usafUpdatedURL usafMainURL
## 168 232
## rawMakeVarMapper problemStates
## 592 624
## readList zeroNA
## 832 840
## fullListMapper zeroPad2
## 848 896
## zeroPad5 pivotMapper
## 896 1024
## checkControlGroupMapper plotSimilarityMapper
## 1056 1056
## glimpseFile checkControlVarsMapper
## 1064 1344
## fileRead uqMapper
## 1344 1400
## perCapMapper urlMapper
## 1408 1488
## lstExcludeMapper combineFiles
## 1544 2072
## vecSelectMapper colSelector
## 2160 2184
## colRenamer glimpseLog
## 2632 2688
## zeroPad customYYYYMM
## 2744 2856
## vecToTibble useStates
## 3192 3248
## genNewLog sumImputedHHS
## 3272 3368
## renMapper helperRollingAgg
## 3488 3584
## lstComboMapper pivotData
## 3624 4256
## fileDownload skinnyHHS
## 4264 4272
## createBurdenCountyDate postProcessCDCDaily
## 4480 4480
## processCountyVaccines printLog
## 4496 4856
## joinFrames getStateData
## 5040 5272
## lagCorrCheck checkUniqueRows
## 5488 5536
## helperPerCapita rowFilter
## 5824 6216
## colMutater dfMinMaxState
## 6272 6536
## dfMinMaxState_v2 getClusters
## 6536 6832
## checkSimilarityMapper onePageCFRPlot
## 6912 6928
## getCountyClusters saveToRDS
## 7280 7608
## lstFilterMapper specSumProd
## 7800 7896
## helperLinePlot clustersToFrame
## 8152 8744
## specNA helperMakePerCapita
## 8888 9016
## checkControl createGroupAgg
## 9192 9296
## getCountyData testImputeNA
## 9608 9632
## integrateData tmpVaxCounts
## 10304 10360
## dfRatio_state createPerCapita
## 10504 10512
## combineAggData imputeNACapacity
## 10864 11016
## pivotStateBurdenData helperSimilarity
## 11120 11448
## plotSimilarity findPeaks
## 11488 11984
## createVaxBurdenData integrateStateVaccine
## 12496 12768
## makeBurdenSummary combineRows
## 13736 13776
## createSummedCountyBurdenData makeCaseHospDeath
## 13968 14560
## checkSimilarity clusterCounties
## 14736 15072
## helperAggTrend selfListMapper
## 17232 17240
## diagnoseClusters processRawFile
## 17472 18480
## flagLargeDelta cumulativeBurdenPlot
## 18992 19728
## helperAggTotal tempStackPlot
## 20448 20736
## hospitalCapacityCDCDaily scoreVaxSimilarity
## 20792 20968
## sparseCountyClusterMap readFromRDS
## 21072 21232
## stateAgeVaxEvolution repairVaxPopulation
## 21560 22264
## downloadCountyVaccines tmpVaxSum
## 22528 24872
## createGeoMap corrVaxBurden
## 24880 25096
## helperElbow keyAggMapper
## 25592 26624
## downloadReadHospitalData plotVaxBurdenData
## 27312 27368
## plotCombineAggByMapper compareAggregate
## 28552 29776
## compareStateSummedCounty filterPopStateAge
## 30616 31616
## hospAgePerCapita scoreSimilarity
## 31632 32152
## plotCFRLag helperSummaryMap
## 32648 32800
## createSummary postProcessCountyData
## 33424 33592
## readQCRawCDCDaily findCorrAlign
## 33816 35104
## readRunCDCDaily cumulativeVaccinePlot
## 35624 36752
## readRunUSAFacts additionalCountyPostProcess
## 38008 38080
## plotHospitalUtilization countyCorr
## 38224 39232
## readQCRawUSAF readPopStateAge
## 42776 43456
## createBurdenPivot tempGetVax
## 44824 45496
## peakValleyCDCDaily makePeakValley
## 46208 50344
## makeBurdenDatePlot clusterStates
## 52216 53696
## bucketPopStateAge createDetailedSummaries
## 54760 77552
## createRestatementData findDeltaFromMax
## 102720 189520
## plotByRestatement keyRatio
## 230688 280600
## decStatus dfMinMax
## 305872 327368
## plotDeltaFromMax anom_cty
## 402320 407360
## dfSegTest dfRatio_cty
## 407360 583032
## dfRatio_cty_chk tmpCountyList
## 583032 591032
## keyRatioDateState keyRatioDateState_v2
## 4695024 4695024
## dfTest dfInter_state
## 5476888 5648248
## dfTest_20221010_state cty_vaxdata_20220914
## 5648248 14110816
## cty_postdata_20220913 cty_postdata_20221010
## 21262688 22494560
## cty_postdata_20221010_v2 cty_postdata_20221108
## 22494560 23122616
## cty_vaxdata_20221109 cty_vaxdata_20221011
## 67665856 71723936
## keyRatioDate compareList
## 337304784 353948352
## dfTest_v2 tmpVaxBurden
## 385465464 390902824
## dfInter_cty dfTest_20221010_cty
## 397530744 397530744
## cty_newdata_20220913 cty_newdata_20221010
## 1351507784 1393594904
## cty_newdata_20221108
## 1432876216
# Function for deleting files and checking memory impact
cleanMem <- function(objs=c(), runGC=TRUE, delObjs=FALSE) {
# FUNCTION ARGUMENTS:
# objs: character vector of objects to be removed
# runGC: boolean, should gc() be run before and after deletion?
# delObjs: boolean, should the objects be deleted?
# Run gc() if requested
if(isTRUE(runGC)) {
cat("\nMemory usage prior to deleting files:\n")
print(gc())
}
# Remove objects that exist, if requested
if(isTRUE(delObjs)) {
objs <- objs[sapply(objs, FUN=function(x) exists(x)) %>% unname]
rm(list=objs, inherits=TRUE)
}
# Run gc() if requested
if(isTRUE(runGC)) {
cat("\nMemory usage after deleting files:\n")
print(gc())
}
}
# Clean large objects
largeObjs <- c("cty_newdata_20220913", "cty_newdata_20221010", "dfTest_20221010_cty", "dfInter_cty")
cleanMem(largeObjs, delObjs=TRUE)
##
## Memory usage prior to deleting files:
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1812274 96.8 13591917 725.9 28735622 1534.7
## Vcells 1012588884 7725.5 1642102704 12528.3 1634910409 12473.4
##
## Memory usage after deleting files:
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1574078 84.1 10873534 580.8 28735622 1534.7
## Vcells 457029447 3486.9 1313682164 10022.6 1634910409 12473.4
The latest county-level burden data are downloaded:
readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221210.csv",
"usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20221210.csv"
)
compareList <- list("usafCase"=readFromRDS("cty_newdata_20221108")$dfRaw$usafCase,
"usafDeath"=readFromRDS("cty_newdata_20221108")$dfRaw$usafDeath
)
# Use existing clusters
cty_newdata_20221210 <- readRunUSAFacts(maxDate="2022-12-08",
downloadTo=lapply(readList,
FUN=function(x) if(file.exists(x)) NA else x
),
readFrom=readList,
compareFile=compareList,
writeLog="./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log",
ovrwriteLog=TRUE,
useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
skipAssessmentPlots=FALSE,
brewPalette="Paired"
)
##
## No file has been downloaded, will use existing file: ./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221210.csv
## Rows: 3193 Columns: 1051
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (1048): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 31
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
## Rows: 3193 Columns: 1051
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (1048): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 31
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType cases new_cases n
## <chr> <dbl> <dbl> <dbl>
## 1 before 4.28e+10 94747847 3343071
## 2 after 4.24e+10 92580585 3289674
## 3 pctchg 1.03e- 2 0.0229 0.0160
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType deaths new_deaths n
## <chr> <dbl> <dbl> <dbl>
## 1 before 6.05e+8 1062085 3343071
## 2 after 5.82e+8 986026 3289674
## 3 pctchg 3.82e-2 0.0716 0.0160
## NULL
# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20221210$useClusters,
caption="Includes only counties with 25k+ population",
brewPalette="viridis"
)
# Save the refreshed file
saveToRDS(cty_newdata_20221210, ovrWriteError=FALSE)
Vaccines data are also updated, though the process will need to integrate previous data:
cty_vaxdata_20221211 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20221211.csv",
ctyList=cty_newdata_20221210,
minDateCD=c("2022-06-09", "2022-06-09"),
maxDateCD="2022-11-26"
)
## Rows: 419908 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (74): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
## state n
## <chr> <int>
## 1 AS 127
## 2 FM 128
## 3 GU 256
## 4 MH 128
## 5 MP 128
## 6 PR 10100
## 7 PW 128
## 8 VI 511
## 9 <NA> 81
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
##
## Count of NA records by column
## state FIPS popgte65_minpop popgte65_maxpop popgte65_nnA
## 0 0 0 0 0
## n
## 0
##
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## # maxpop <dbl>
## # ℹ Use `colnames()` to see all variable names
##
##
##
## Will run with parameters:
## burdenVar: cpm dpm
## vaxVar: vxcpoppct vxcpoppct
## minDateCD: 2022-06-09 2022-06-09
## maxDateCD: 2022-11-26 2022-11-26
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -297120776 -1514524 273465 2356196 66598980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26818.28 2624.44 10.219 <2e-16 ***
## vaxMetric 61.12 40.57 1.507 0.132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8839000 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.0007262, Adjusted R-squared: 0.0004063
## F-statistic: 2.27 on 1 and 3124 DF, p-value: 0.132
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -295965537 -1728713 91978 2150356 65429580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 24826.85 9529.17 2.605 0.009222 **
## type>500k 19240.00 5617.42 3.425 0.000623 ***
## type100k-500k 22482.88 5599.32 4.015 6.08e-05 ***
## type25k-100k 27074.67 6314.96 4.287 1.86e-05 ***
## vaxMetric:type<25k 138.11 191.87 0.720 0.471701
## vaxMetric:type>500k 157.83 79.62 1.982 0.047520 *
## vaxMetric:type100k-500k 136.21 89.94 1.514 0.130025
## vaxMetric:type25k-100k 84.67 119.18 0.710 0.477507
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8840000 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.5587, Adjusted R-squared: 0.5575
## F-statistic: 493.4 on 8 and 3118 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3594180 -16484 1232 27705 699297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 414.195 33.578 12.335 <2e-16 ***
## vaxMetric -4.859 0.519 -9.363 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113100 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.02729, Adjusted R-squared: 0.02698
## F-statistic: 87.66 on 1 and 3124 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3538013 -22248 -5471 20367 731049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 322.9766 121.1872 2.665 0.007736 **
## type>500k 278.1775 71.4396 3.894 0.000101 ***
## type100k-500k 191.0878 71.2093 2.683 0.007325 **
## type25k-100k 335.2734 80.3105 4.175 3.07e-05 ***
## vaxMetric:type<25k -1.7903 2.4401 -0.734 0.463187
## vaxMetric:type>500k -3.3374 1.0125 -3.296 0.000991 ***
## vaxMetric:type100k-500k -0.9895 1.1438 -0.865 0.387057
## vaxMetric:type25k-100k -2.4343 1.5157 -1.606 0.108362
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 112400 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.1185, Adjusted R-squared: 0.1162
## F-statistic: 52.39 on 8 and 3118 DF, p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20221211, ovrWriteError=FALSE)
County-level data are post-processed:
cty_postdata_20221210 <- postProcessCountyData(lstCtyBurden=cty_newdata_20221210$dfPerCapita,
lstCtyVax=cty_vaxdata_20221211$vaxFix,
lstState=readFromRDS("cdc_daily_221202")$dfPerCapita,
excludeStates="AK"
)
##
## Parameter maxDate is: 2022-12-01
Additional post-processing steps are re-run:
# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20221210, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 rows containing missing values (`geom_line()`).
# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20221210,
p1CompareStates=c("GA", "FL", "NE"),
p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"),
p3VaxTimes=sort(c("2022-01-01", "2022-11-30")),
p4DF=cty_newdata_20221210$dfPerCapita,
excludeStates=c("AK")
)
## Warning: Removed 6 rows containing missing values (`geom_line()`).
## Warning: Removed 6 rows containing missing values (`geom_line()`).
## Warning: Removed 6 rows containing missing values (`geom_line()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
Memory is cleaned:
# List of files
sapply(ls(), FUN=function(x) object.size(get(x))) %>% sort(decreasing=FALSE)
## usafUpdatedURL usafMainURL
## 168 232
## largeObjs rawMakeVarMapper
## 384 592
## problemStates readList
## 624 832
## zeroNA fullListMapper
## 840 848
## zeroPad2 zeroPad5
## 896 896
## pivotMapper checkControlGroupMapper
## 1024 1056
## plotSimilarityMapper glimpseFile
## 1056 1064
## checkControlVarsMapper fileRead
## 1344 1344
## uqMapper perCapMapper
## 1400 1408
## urlMapper lstExcludeMapper
## 1488 1544
## combineFiles vecSelectMapper
## 2072 2160
## colSelector colRenamer
## 2184 2632
## glimpseLog zeroPad
## 2688 2744
## customYYYYMM vecToTibble
## 2856 3192
## useStates genNewLog
## 3248 3272
## sumImputedHHS renMapper
## 3368 3488
## helperRollingAgg lstComboMapper
## 3584 3624
## pivotData fileDownload
## 4256 4264
## skinnyHHS createBurdenCountyDate
## 4272 4480
## postProcessCDCDaily processCountyVaccines
## 4480 4496
## printLog joinFrames
## 4856 5040
## getStateData lagCorrCheck
## 5272 5488
## checkUniqueRows helperPerCapita
## 5536 5824
## rowFilter colMutater
## 6216 6272
## dfMinMaxState dfMinMaxState_v2
## 6536 6536
## getClusters checkSimilarityMapper
## 6832 6912
## onePageCFRPlot getCountyClusters
## 6928 7280
## saveToRDS lstFilterMapper
## 7608 7800
## specSumProd helperLinePlot
## 7896 8152
## clustersToFrame specNA
## 8744 8888
## helperMakePerCapita checkControl
## 9016 9192
## createGroupAgg getCountyData
## 9296 9608
## testImputeNA integrateData
## 9632 10304
## tmpVaxCounts dfRatio_state
## 10360 10504
## createPerCapita combineAggData
## 10512 10864
## imputeNACapacity pivotStateBurdenData
## 11016 11120
## helperSimilarity plotSimilarity
## 11448 11488
## findPeaks createVaxBurdenData
## 11984 12496
## integrateStateVaccine makeBurdenSummary
## 12768 13736
## combineRows createSummedCountyBurdenData
## 13776 13968
## makeCaseHospDeath checkSimilarity
## 14560 14736
## clusterCounties helperAggTrend
## 15072 17232
## selfListMapper diagnoseClusters
## 17240 17472
## processRawFile flagLargeDelta
## 18480 18992
## cumulativeBurdenPlot helperAggTotal
## 19728 20448
## tempStackPlot hospitalCapacityCDCDaily
## 20736 20792
## scoreVaxSimilarity sparseCountyClusterMap
## 20968 21072
## readFromRDS stateAgeVaxEvolution
## 21232 21560
## repairVaxPopulation downloadCountyVaccines
## 22264 22528
## cleanMem tmpVaxSum
## 24184 24872
## createGeoMap corrVaxBurden
## 24880 25096
## helperElbow keyAggMapper
## 25592 26624
## downloadReadHospitalData plotVaxBurdenData
## 27312 27368
## plotCombineAggByMapper compareAggregate
## 28552 29776
## compareStateSummedCounty filterPopStateAge
## 30616 31616
## hospAgePerCapita scoreSimilarity
## 31632 32152
## plotCFRLag helperSummaryMap
## 32648 32800
## createSummary postProcessCountyData
## 33424 33592
## readQCRawCDCDaily findCorrAlign
## 33816 35104
## readRunCDCDaily cumulativeVaccinePlot
## 35624 36752
## readRunUSAFacts additionalCountyPostProcess
## 38008 38080
## plotHospitalUtilization countyCorr
## 38224 39232
## readQCRawUSAF readPopStateAge
## 42776 43456
## createBurdenPivot tempGetVax
## 44824 45496
## peakValleyCDCDaily makePeakValley
## 46208 50344
## makeBurdenDatePlot clusterStates
## 52216 53696
## bucketPopStateAge createDetailedSummaries
## 54760 77552
## createRestatementData findDeltaFromMax
## 102720 189520
## plotByRestatement keyRatio
## 230688 280600
## decStatus dfMinMax
## 305872 327368
## plotDeltaFromMax anom_cty
## 402320 407360
## dfSegTest dfRatio_cty
## 407360 583032
## dfRatio_cty_chk tmpCountyList
## 583032 591032
## keyRatioDateState keyRatioDateState_v2
## 4695024 4695024
## dfTest dfInter_state
## 5476888 5648248
## dfTest_20221010_state cty_vaxdata_20220914
## 5648248 14110816
## cty_postdata_20220913 cty_postdata_20221010
## 21262688 22494560
## cty_postdata_20221010_v2 cty_postdata_20221108
## 22494560 23122616
## cty_postdata_20221210 cty_vaxdata_20221211
## 23724656 67487216
## cty_vaxdata_20221109 cty_vaxdata_20221011
## 67665856 71723936
## keyRatioDate compareList
## 337304784 363961600
## dfTest_v2 tmpVaxBurden
## 385465464 390902824
## cty_newdata_20221108 cty_newdata_20221210
## 1432876216 1476366240
# Clean large objects
largeObjs <- c("cty_newdata_20221108", "cty_newdata_20221210", "cty_vaxdata_20221109", "cty_vaxdata_20221211",
"dfTest_v2", "tmpVaxBurden", "compareList", "keyRatioDate"
)
cleanMem(largeObjs, delObjs=TRUE)
##
## Memory usage prior to deleting files:
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1693649 90.5 8698828 464.6 28735622 1534.7
## Vcells 714525595 5451.4 1313682164 10022.6 1634910409 12473.4
##
## Memory usage after deleting files:
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1450593 77.5 6959063 371.7 28735622 1534.7
## Vcells 30487222 232.6 1050945732 8018.1 1634910409 12473.4
# List of files
sapply(ls(), FUN=function(x) object.size(get(x))) %>% sort(decreasing=FALSE)
## usafUpdatedURL usafMainURL
## 168 232
## rawMakeVarMapper problemStates
## 592 624
## largeObjs readList
## 688 832
## zeroNA fullListMapper
## 840 848
## zeroPad2 zeroPad5
## 896 896
## pivotMapper checkControlGroupMapper
## 1024 1056
## plotSimilarityMapper glimpseFile
## 1056 1064
## checkControlVarsMapper fileRead
## 1344 1344
## uqMapper perCapMapper
## 1400 1408
## urlMapper lstExcludeMapper
## 1488 1544
## combineFiles vecSelectMapper
## 2072 2160
## colSelector colRenamer
## 2184 2632
## glimpseLog zeroPad
## 2688 2744
## customYYYYMM vecToTibble
## 2856 3192
## useStates genNewLog
## 3248 3272
## sumImputedHHS renMapper
## 3368 3488
## helperRollingAgg lstComboMapper
## 3584 3624
## pivotData fileDownload
## 4256 4264
## skinnyHHS createBurdenCountyDate
## 4272 4480
## postProcessCDCDaily processCountyVaccines
## 4480 4496
## printLog joinFrames
## 4856 5040
## getStateData lagCorrCheck
## 5272 5488
## checkUniqueRows helperPerCapita
## 5536 5824
## rowFilter colMutater
## 6216 6272
## dfMinMaxState dfMinMaxState_v2
## 6536 6536
## getClusters checkSimilarityMapper
## 6832 6912
## onePageCFRPlot getCountyClusters
## 6928 7280
## saveToRDS lstFilterMapper
## 7608 7800
## specSumProd helperLinePlot
## 7896 8152
## clustersToFrame specNA
## 8744 8888
## helperMakePerCapita checkControl
## 9016 9192
## createGroupAgg getCountyData
## 9296 9608
## testImputeNA integrateData
## 9632 10304
## tmpVaxCounts dfRatio_state
## 10360 10504
## createPerCapita combineAggData
## 10512 10864
## imputeNACapacity pivotStateBurdenData
## 11016 11120
## helperSimilarity plotSimilarity
## 11448 11488
## findPeaks createVaxBurdenData
## 11984 12496
## integrateStateVaccine makeBurdenSummary
## 12768 13736
## combineRows createSummedCountyBurdenData
## 13776 13968
## makeCaseHospDeath checkSimilarity
## 14560 14736
## clusterCounties helperAggTrend
## 15072 17232
## selfListMapper diagnoseClusters
## 17240 17472
## processRawFile flagLargeDelta
## 18480 18992
## cumulativeBurdenPlot helperAggTotal
## 19728 20448
## tempStackPlot hospitalCapacityCDCDaily
## 20736 20792
## scoreVaxSimilarity sparseCountyClusterMap
## 20968 21072
## readFromRDS stateAgeVaxEvolution
## 21232 21560
## repairVaxPopulation downloadCountyVaccines
## 22264 22528
## tmpVaxSum createGeoMap
## 24872 24880
## corrVaxBurden helperElbow
## 25096 25592
## keyAggMapper downloadReadHospitalData
## 26624 27312
## plotVaxBurdenData plotCombineAggByMapper
## 27368 28552
## compareAggregate compareStateSummedCounty
## 29776 30616
## filterPopStateAge hospAgePerCapita
## 31616 31632
## scoreSimilarity plotCFRLag
## 32152 32648
## helperSummaryMap createSummary
## 32800 33424
## postProcessCountyData readQCRawCDCDaily
## 33592 33816
## findCorrAlign readRunCDCDaily
## 35104 35624
## cumulativeVaccinePlot readRunUSAFacts
## 36752 38008
## additionalCountyPostProcess plotHospitalUtilization
## 38080 38224
## countyCorr readQCRawUSAF
## 39232 42776
## readPopStateAge createBurdenPivot
## 43456 44824
## tempGetVax peakValleyCDCDaily
## 45496 46208
## makePeakValley makeBurdenDatePlot
## 50344 52216
## clusterStates bucketPopStateAge
## 53696 54760
## createDetailedSummaries cleanMem
## 77552 91952
## createRestatementData findDeltaFromMax
## 102720 189520
## plotByRestatement keyRatio
## 230688 280600
## decStatus dfMinMax
## 305872 327368
## plotDeltaFromMax anom_cty
## 402320 407360
## dfSegTest dfRatio_cty
## 407360 583032
## dfRatio_cty_chk tmpCountyList
## 583032 591032
## keyRatioDateState keyRatioDateState_v2
## 4695024 4695024
## dfTest dfInter_state
## 5476888 5648248
## dfTest_20221010_state cty_vaxdata_20220914
## 5648248 14110816
## cty_postdata_20220913 cty_postdata_20221010
## 21262688 22494560
## cty_postdata_20221010_v2 cty_postdata_20221108
## 22494560 23122616
## cty_postdata_20221210 cty_vaxdata_20221011
## 23724656 71723936
The latest county-level burden data are downloaded:
readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20230108.csv",
"usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20230108.csv"
)
compareList <- list("usafCase"=readFromRDS("cty_newdata_20221210")$dfRaw$usafCase,
"usafDeath"=readFromRDS("cty_newdata_20221210")$dfRaw$usafDeath
)
# Use existing clusters
cty_newdata_20230108 <- readRunUSAFacts(maxDate="2023-01-06",
downloadTo=lapply(readList,
FUN=function(x) if(file.exists(x)) NA else x
),
readFrom=readList,
compareFile=compareList,
writeLog="./RInputFiles/Coronavirus/USAF_NewData_20230108_chk_v005.log",
ovrwriteLog=TRUE,
useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
skipAssessmentPlots=FALSE,
brewPalette="Paired"
)
##
## No file has been downloaded, will use existing file: ./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20230108.csv
## Rows: 3193 Columns: 1081
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (1078): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 30
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20230108_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 32 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20230108_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 1861 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20230108_chk_v005.log
## Rows: 3193 Columns: 1081
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County Name, State, StateFIPS
## dbl (1078): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS
##
##
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date
##
##
## Checking for similarity of: column names
## In reference but not in current:
## In current but not in reference:
##
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 30
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20230108_chk_v005.log
##
## Checking for similarity of: county
## In reference but not in current:
## In current but not in reference:
##
##
## ***Differences of at least 5 and at least 5%
##
## 26 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20230108_chk_v005.log
##
##
## ***Differences of at least 0 and at least 0.1%
##
## 670 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20230108_chk_v005.log
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType cases new_cases n
## <chr> <dbl> <dbl> <dbl>
## 1 before 4.57e+10 95880716 3438861
## 2 after 4.52e+10 93726971 3383934
## 3 pctchg 1.12e- 2 0.0225 0.0160
##
##
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
## isType deaths new_deaths n
## <chr> <dbl> <dbl> <dbl>
## 1 before 6.37e+8 1068813 3438861
## 2 after 6.12e+8 992220 3383934
## 3 pctchg 3.99e-2 0.0717 0.0160
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## NULL
# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20230108$useClusters,
caption="Includes only counties with 25k+ population",
brewPalette="viridis"
)
# Save the refreshed file
saveToRDS(cty_newdata_20230108, ovrWriteError=FALSE)
Vaccines data are also updated, though the process will need to integrate previous data:
# Updated to remove all_of() from purr::set_name()
countyCorr <- function(dfCaseDeath=NULL,
dfVax=NULL,
dfFull=NULL,
useStates=c(state.abb, "DC"),
minDateCD=NULL,
maxDateCD=NULL,
vaxDate=NULL,
vaxVar="vxcpoppct",
burdenVar="cpm",
showPlots=TRUE,
showReg=isTRUE(showPlots),
returnData=!isTRUE(showPlots)
) {
# FUNCTION ARGUMENTS:
# dfCaseDeath: data frame for county-level cases and deaths (NULL means dfFull passed)
# dfVax: data frame for county-level vaccines (NULL means dfFull passed)
# dfFull: processed data frame that can be used directly
# useStates: states to be included
# minDateCD: earliest date for summing cases and deaths (NULL means use all)
# maxDateCD: latest date for summing cases and deaths (NULL means use all)
# vaxDate: date to use for vaccinated status (NULL means use minDateCD)
# vaxVar: variable to use for vaccinated status by county
# showPlots: boolean, should the plots be created and displayed?
# showReg: boolean, should the regressions be calculated with summaries shown?
# returnData: boolean, should the data be returned?
# Check that either dfFull OR both of dfCaseDeath and dfVax have been passed
if(is.null(dfFull) & (is.null(dfCaseDeath) | is.null(dfVax)))
stop("\nMust pass either dfFull OR both of dfCaseDeath and dfVax\n")
if(!is.null(dfFull) & (!is.null(dfCaseDeath) | !is.null(dfVax)))
cat("\ndfFull will be used; data passed for dfCaseDeath or dfVax is ignored\n")
if(is.null(dfFull)) {
# Get minDateCD and maxDateCD and vaxDate if passed as NULL
if(is.null(minDateCD)) minDateCD <- min(dfCaseDeath$date)
if(is.null(maxDateCD)) maxDateCD <- max(dfCaseDeath$date)
if(is.null(vaxDate)) vaxDate <- minDateCD
# Create the cases and deaths data for analysis
dfCaseDeath <- dfCaseDeath %>%
filter(date >= minDateCD,
date <= maxDateCD,
state %in% all_of(useStates)
) %>%
group_by(fips=countyFIPS) %>%
summarize(across(c(cpm, dpm), sum, na.rm=TRUE), .groups="drop")
# Create the vaccines data for analysis
dfFull <- dfVax %>%
filter(date==vaxDate,
FIPS != "UNK",
state %in% all_of(useStates)
) %>%
select(fips=FIPS, countyName, pop, state, vxcpoppct, vxcgte18pct, vxcgte65pct) %>%
full_join(dfCaseDeath, by="fips") %>%
mutate(type=case_when(pop>=500000 ~ ">500k",
pop>=100000 ~ "100k-500k",
pop>=25000 ~ "25k-100k",
TRUE ~ "<25k"
)
)
}
if(isTRUE(showPlots)) {
# Create the basic plot (should update so dates can be gathered from dfFull)
p1 <- dfFull %>%
ggplot(aes(x=get(vaxVar), y=get(burdenVar))) +
geom_point(aes(size=pop)) +
geom_smooth(aes(weight=pop), method="lm") +
labs(x=paste0("% fully vaccinated on: ", vaxDate, "\n(using variable ", vaxVar, ")"),
y=paste0("Burden per million ", minDateCD, " to ", maxDateCD, "\n(using variable ", burdenVar, ")"),
title="County-level associations between vaccines and burden",
subtitle="Smooth is simple linear model weighted by population"
) +
lims(x=c(0, 100)) +
scale_size_continuous("County\nPop")
# Print the plot and the facetted version
print(p1)
print(p1 + facet_wrap(~type))
}
# Display the regressions (if requested)
if(isTRUE(showReg)) {
dfReg <- dfFull %>% colRenamer("vaxMetric" %>% purrr::set_names(vaxVar))
lm(get(burdenVar) ~ vaxMetric, data=dfReg, weights=pop) %>% summary() %>% print()
lm(get(burdenVar) ~ vaxMetric*type + 0 - vaxMetric, data=dfReg, weights=pop) %>% summary() %>% print()
}
# Return the data frame if requested
if(isTRUE(returnData)) return(dfFull)
}
cty_vaxdata_20230110 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20230110.csv",
ctyList=readFromRDS("cty_newdata_20230108"),
minDateCD=c("2022-06-09", "2022-06-09"),
maxDateCD="2022-12-31"
)
## Rows: 400690 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (74): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
## state n
## <chr> <int>
## 1 AS 122
## 2 FM 122
## 3 GU 244
## 4 MH 122
## 5 MP 122
## 6 PR 9642
## 7 PW 122
## 8 VI 488
## 9 <NA> 81
## Warning: Removed 16 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 16 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 16 rows containing non-finite values (`stat_boxplot()`).
##
## Count of NA records by column
## state FIPS popgte65_minpop popgte65_maxpop popgte65_nnA
## 0 0 0 0 0
## n
## 0
##
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## # maxpop <dbl>
##
##
##
## Will run with parameters:
## burdenVar: cpm dpm
## vaxVar: vxcpoppct vxcpoppct
## minDateCD: 2022-06-09 2022-06-09
## maxDateCD: 2022-12-31 2022-12-31
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 16 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 16 rows containing missing values (`geom_point()`).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -304362706 -1754868 320799 2697810 169325018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27263.79 3171.84 8.596 < 2e-16 ***
## vaxMetric 127.27 49.03 2.596 0.00948 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10680000 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.002152, Adjusted R-squared: 0.001833
## F-statistic: 6.739 on 1 and 3124 DF, p-value: 0.009479
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -303955387 -2037476 -24890 2399119 168597585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 30696.98 11515.04 2.666 0.007720 **
## type>500k 16883.07 6788.09 2.487 0.012929 *
## type100k-500k 24063.52 6766.21 3.556 0.000382 ***
## type25k-100k 27033.05 7630.99 3.543 0.000402 ***
## vaxMetric:type<25k 155.55 231.86 0.671 0.502329
## vaxMetric:type>500k 266.82 96.21 2.773 0.005580 **
## vaxMetric:type100k-500k 174.54 108.68 1.606 0.108390
## vaxMetric:type25k-100k 148.72 144.02 1.033 0.301857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10680000 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.5352, Adjusted R-squared: 0.534
## F-statistic: 448.7 on 8 and 3118 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 16 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: weight
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 16 rows containing missing values (`geom_point()`).
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3641114 -19626 1751 30524 747147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 455.0622 34.8592 13.054 <2e-16 ***
## vaxMetric -5.0746 0.5388 -9.418 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 117400 on 3124 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.02761, Adjusted R-squared: 0.0273
## F-statistic: 88.69 on 1 and 3124 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric,
## data = dfReg, weights = pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3587593 -26017 -5791 22910 780567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type<25k 358.768 125.815 2.852 0.00438 **
## type>500k 288.980 74.168 3.896 9.97e-05 ***
## type100k-500k 231.022 73.929 3.125 0.00179 **
## type25k-100k 396.486 83.377 4.755 2.07e-06 ***
## vaxMetric:type<25k -1.808 2.533 -0.714 0.47549
## vaxMetric:type>500k -3.125 1.051 -2.973 0.00297 **
## vaxMetric:type100k-500k -1.242 1.188 -1.046 0.29570
## vaxMetric:type25k-100k -2.939 1.574 -1.868 0.06190 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 116700 on 3118 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.1509, Adjusted R-squared: 0.1488
## F-statistic: 69.28 on 8 and 3118 DF, p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20230110, ovrWriteError=FALSE)